COBS: A Background

The Comparison of Biofuels Systems (COBS) is a field trial and collaboration between Iowa State University and the University of Illinois. The COBS experiment seeks to compare the impacts of different biofuel production systems. Of particular interest is the impact of the production system on carbon cycling and soil health. In order to investigate this impact, a unique soil sampling was performed in 2012, so that metagenomes could be assembled and used to explore the distribution of carbon cycling genes.

What made this soil sampling unique was the separation of the soil into aggregate fractions, in a non-destructive manner. In short, the soil was sieved at 4 degrees c. using sieves that corresponding to the aggregate fraction of interest.

Previous investigations into this data set have shown that there is a distribution/differences in bacterial communities across aggregate fractions and cropping systems.

The above image is a figure taken from the previous work done on our data set. What it is showing is how closely related the bacterial communities of the aggregate fractions and the cropping systems are. The further apart on this graph, the more dissimilar the objects are.

What is an aggregate?

Soil aggregates are groups of soil particles that bind to each other more strongly than to adjacent particles. The space between the aggregates provide pore space for retention and exchange of air and water.

What is a metagenome?

A library constructed from the DNA extracted from all organisms in the soil.

Now What?

So we have a unique data set that consists of all the DNA from soil aggregate fractions from the COBS experiment. In addition to the genomic data, we started with metadata that was not tidy.

Our challenge was to use the metagenome libraries to construct abundance tables that quantify the abundance of bacterial species capable of performing a step in the decomposition of cellulose. In order to facilitate our analysis of these data in R, we had to tidy up the COBS metadata. The metadata consists of many chemical and physical characteristics of the soil samples from the cobs aggregate fractions.

For investigating the presence of cellulolytic bacteria, we needed to generate a list of bacterial sequences associated with cellulose degradation. We did this by searching the NCBI database for amino acid sequences associated with 1 of 3 enzymes. The enzymes we are interested are all involved in the break down of cellulose in the soil. Once we have this list, we can compare the aggregate fractions for the abundance of cellulolytic bacteria.

We can do this by quantifying the amount of bacteria that have genes associated with carbon degrading enzymes of interest. These enzymes are:

  1. (BG) beta-glucosidase [EC:3.2.1.21] is an Endocellulase
  2. (BX) beta-D-xylosidase 4 [EC:3.2.1.37] is an Exocellulase
  3. (CB) 1,4-beta-cellobiosidase [EC:3.2.1.91] is a Cellobiase

Together these three enzymes can convert a cellulose crystal to glucose molecules, that are an important energy source for the microbial community in the soil.

In order to generate gene counts for each enzyme in each soil sample, I first had to generate files that contain protein sequences associated with the enzymes previously mentioned. We tried two different approaches for generating this data, one attempted to automate the process by searching the NCBI database directly for nucleotide sequences associated with our enzymes. However, the script did not perform as accurately as manually searching NCBI and downloading the protein sequences. The failed script for searching NCBI is shown below. We plan to troubleshoot this script further in hopes of bringing it’s performance up. The problem is it does not generate as many sequences that we can use to query the databases as a manual search did.

import sys
from Bio import Entrez, SeqIO

Entrez.email = 'jflater@iastate.edu'

# First, find entries that contain the E.C. number
ec_num = sys.argv[1].strip()
# Editing to only search for #, exclude "EC and E.C."
esearch_handle = Entrez.esearch(db='nucleotide', term=ec_num)
entries = Entrez.read(esearch_handle)
esearch_handle.close()

# Second, fetch these entries
efetch_handle = Entrez.efetch(db='nucleotide', id=entries['IdList'], rettype='gb', retmode='xml') 
records = Entrez.parse(efetch_handle)

# Now, we go through the records and look for a feature with name 'EC_number'
for record in records:
      for feature in record['GBSeq_feature-table']:
          for subfeature in feature['GBFeature_quals']:
              if (subfeature['GBQualifier_name'] == 'EC_number'   and
                subfeature['GBQualifier_value'] == ec_num):

                    # If we found it, we extract the seq's start and end
                    accession = record['GBSeq_primary-accession']
                    interval = feature['GBFeature_intervals'][0]
                    interval_start = interval['GBInterval_from']
                    interval_end = interval['GBInterval_to']
                    location = feature['GBFeature_location']
                    if location.startswith('complement'):
                        strand = 2
                    else:
                        strand = 1

                    # Now we fetch the nucleotide sequence
                    handle = Entrez.efetch(db="nucleotide", id=accession,
                                           rettype="fasta", strand=strand,
                                           seq_start = interval_start,
                                           seq_stop = interval_end)
                    seq = SeqIO.read(handle, "fasta")

  b                  print('>GenBank Accession:{}'.format(accession))
                    print(seq.seq)
efetch_handle.close()

We run that script on this file:

ec_numbers<- read.table("data/ec_numbers.txt")
head(ec_numbers)
##         V1
## 1  3.2.1.4
## 2 3.2.1.91
## 3 3.2.1.21
## 4 3.2.1.37
## 5 3.2.1.41
## 6 3.2.1.86

By using a while loop:

while read line;     
  do python scripts/nucl_from_ec.py $line > "$line".txt;    
  done < ec_numbers.txt

It’s important to remember that this script did not work, we only include it to show our attempt at partial automation and to incentive ourselves to debug the code more in order to improve it’s performance.

See BLAST.Rmd for how to generate count table when starting from EC numbers.

We didn’t use this method because the manual search method returned more protein sequences than the script, we believe it is an issue with the package needed to download from NCBI. Fewer protein sequences would mean that we would potentially find few similarities in the RefSeq database, which we use to find all seq similar to a protein sequence. Therefore, the manual method was used and is listed in the BLAST.Rmd along with all scripts associated with generating the gene count tables for each enzyme.

The summary counts for each enzyme that the BLAST.Rmd pipeline generate look like:

summary_count_21<- read.table("data/summary_counts/summary-count-21.tsv")
head(summary_count_21)
##                A10_GATCAG_L006_R1_001 A10_GATCAG_L006_R2_001
## IF31_RS0116175                      0                      1
## K314_RS0111100                      0                      0
## PSNIH1_RS09265                      0                      0
## B072_RS0125095                      0                      0
## G618_RS0100660                      0                      0
## TU94_RS12250                        2                      0
##                A11_TAGCTT_L006_R1_001 A11_TAGCTT_L006_R2_001
## IF31_RS0116175                      0                      0
## K314_RS0111100                      2                      0
## PSNIH1_RS09265                      0                      0
## B072_RS0125095                      0                      0
## G618_RS0100660                      0                      0
## TU94_RS12250                        3                      1
##                A12_GGCTAC_L006_R1_001 A12_GGCTAC_L006_R2_001
## IF31_RS0116175                      0                      0
## K314_RS0111100                      1                      1
## PSNIH1_RS09265                      0                      0
## B072_RS0125095                      1                      0
## G618_RS0100660                      0                      0
## TU94_RS12250                        3                      2
##                A1_CGATGT_L007_R1_001 A1_CGATGT_L007_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       2                     2
##                A2_TGACCA_L007_R1_001 A2_TGACCA_L007_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       0                     1
##                A3_ACAGTG_L007_R1_001 A3_ACAGTG_L007_R2_001
## IF31_RS0116175                     1                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       0                     1
##                A4_GCCAAT_L007_R1_001 A4_GCCAAT_L007_R2_001
## IF31_RS0116175                     0                     1
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       1                     2
##                A5_CAGATC_L007_R1_001 A5_CAGATC_L007_R2_001
## IF31_RS0116175                     1                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       0                     0
##                A6_CTTGTA_L007_R1_001 A6_CTTGTA_L007_R2_001
## IF31_RS0116175                     1                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     1                     1
## TU94_RS12250                       1                     0
##                A7_ATCACG_L007_R1_001 A7_ATCACG_L007_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       1                     0
##                A8_TTAGGC_L007_R1_001 A8_TTAGGC_L007_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       1                     0
##                A9_ACTTGA_L006_R1_001 A9_ACTTGA_L006_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       0                     0
##                B10_GAGTGG_L007_R1_001 B10_GAGTGG_L007_R2_001
## IF31_RS0116175                      0                      0
## K314_RS0111100                      0                      0
## PSNIH1_RS09265                      0                      0
## B072_RS0125095                      0                      0
## G618_RS0100660                      0                      0
## TU94_RS12250                        1                      0
##                B11_ACTGAT_L007_R1_001 B11_ACTGAT_L007_R2_001
## IF31_RS0116175                      0                      0
## K314_RS0111100                      0                      0
## PSNIH1_RS09265                      0                      0
## B072_RS0125095                      0                      0
## G618_RS0100660                      0                      0
## TU94_RS12250                        1                      0
##                B12_ATTCCT_L007_R1_001 B12_ATTCCT_L007_R2_001
## IF31_RS0116175                      1                      0
## K314_RS0111100                      0                      0
## PSNIH1_RS09265                      0                      0
## B072_RS0125095                      0                      0
## G618_RS0100660                      0                      0
## TU94_RS12250                        1                      0
##                B1_AGTCAA_L006_R1_001 B1_AGTCAA_L006_R2_001
## IF31_RS0116175                     0                     1
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       0                     1
##                B2_AGTTCC_L006_R1_001 B2_AGTTCC_L006_R2_001
## IF31_RS0116175                     2                     2
## K314_RS0111100                     0                     1
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       2                     0
##                B3_ATGTCA_L006_R1_001 B3_ATGTCA_L006_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       0                     0
##                B4_CCGTCC_L006_R1_001 B4_CCGTCC_L006_R2_001
## IF31_RS0116175                     1                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     1                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       0                     1
##                B5_GTCCGC_L007_R1_001 B5_GTCCGC_L007_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       0                     1
##                B6_GTGAAA_L007_R1_001 B6_GTGAAA_L007_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     1                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       0                     3
##                B7_GTGGCC_L007_R1_001 B7_GTGGCC_L007_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       1                     3
##                B8_GTTTCG_L007_R1_001 B8_GTTTCG_L007_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       1                     0
##                B9_CGTACG_L007_R1_001 B9_CGTACG_L007_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       1                     0
##                H10_CCGTCC_L008_R1_001 H10_CCGTCC_L008_R2_001
## IF31_RS0116175                      0                      0
## K314_RS0111100                      0                      0
## PSNIH1_RS09265                      0                      0
## B072_RS0125095                      0                      0
## G618_RS0100660                      0                      0
## TU94_RS12250                        0                      0
##                H11_GTCCGC_L008_R1_001 H11_GTCCGC_L008_R2_001
## IF31_RS0116175                      0                      0
## K314_RS0111100                      0                      0
## PSNIH1_RS09265                      0                      0
## B072_RS0125095                      0                      0
## G618_RS0100660                      0                      0
## TU94_RS12250                        1                      2
##                H12_GTGAAA_L008_R1_001 H12_GTGAAA_L008_R2_001
## IF31_RS0116175                      0                      0
## K314_RS0111100                      0                      0
## PSNIH1_RS09265                      0                      0
## B072_RS0125095                      0                      1
## G618_RS0100660                      0                      0
## TU94_RS12250                        3                      2
##                H13_ATGTCA_L007_R1_001 H13_ATGTCA_L007_R2_001
## IF31_RS0116175                      0                      0
## K314_RS0111100                      0                      0
## PSNIH1_RS09265                      0                      0
## B072_RS0125095                      0                      0
## G618_RS0100660                      0                      0
## TU94_RS12250                        0                      0
##                H14_ACTTGA_L007_R1_001 H14_ACTTGA_L007_R2_001
## IF31_RS0116175                      0                      0
## K314_RS0111100                      3                      1
## PSNIH1_RS09265                      1                      0
## B072_RS0125095                      0                      0
## G618_RS0100660                      0                      0
## TU94_RS12250                        0                      5
##                H1_CGATGT_L006_R1_001 H1_CGATGT_L006_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     1                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       1                     1
##                H3_ACAGTG_L006_R1_001 H3_ACAGTG_L006_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       1                     0
##                H4_GCCAAT_L006_R1_001 H4_GCCAAT_L006_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     1
## G618_RS0100660                     0                     0
## TU94_RS12250                       5                     3
##                H5_CAGATC_L006_R1_001 H5_CAGATC_L006_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       0                     0
##                H6_CTTGTA_L007_R1_001 H6_CTTGTA_L007_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       0                     0
##                H8_AGTTCC_L007_R1_001 H8_AGTTCC_L007_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       0                     2
##                H9_ATCACG_L008_R1_001 H9_ATCACG_L008_R2_001
## IF31_RS0116175                     0                     0
## K314_RS0111100                     0                     0
## PSNIH1_RS09265                     0                     0
## B072_RS0125095                     0                     0
## G618_RS0100660                     0                     0
## TU94_RS12250                       1                     3
##                Hofmocke17_AGTTCC_L005_R1_001 Hofmocke17_AGTTCC_L005_R2_001
## IF31_RS0116175                             0                             0
## K314_RS0111100                             0                             0
## PSNIH1_RS09265                             0                             0
## B072_RS0125095                             0                             0
## G618_RS0100660                             0                             0
## TU94_RS12250                               0                             0
##                Hofmockel15_GATCAG_L003_R1_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel15_GATCAG_L003_R2_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel16_TAGCTT_L003_R1_001
## IF31_RS0116175                              1
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                1
##                Hofmockel16_TAGCTT_L003_R2_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                2
##                Hofmockel18_GTGGCC_L005_R1_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel18_GTGGCC_L005_R2_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel19_GTTTCG_L005_R1_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel19_GTTTCG_L005_R2_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel20_CGTACG_L005_R1_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel20_CGTACG_L005_R2_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                1
##                Hofmockel2_TGACCA_L003_R1_001 Hofmockel2_TGACCA_L003_R2_001
## IF31_RS0116175                             0                             0
## K314_RS0111100                             0                             0
## PSNIH1_RS09265                             0                             0
## B072_RS0125095                             0                             0
## G618_RS0100660                             0                             0
## TU94_RS12250                               1                             1
##                Hofmockel30_GAGTGG_L006_R1_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel30_GAGTGG_L006_R2_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel34_ACTGAT_L006_R1_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel34_ACTGAT_L006_R2_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel35_ATTCCT_L006_R1_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel35_ATTCCT_L006_R2_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel39_CGATGT_L006_R1_001
## IF31_RS0116175                              1
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel39_CGATGT_L006_R2_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel41_TGACCA_L001_R1_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel41_TGACCA_L001_R2_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                1
##                Hofmockel43_ACAGTG_L001_R1_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                1
##                Hofmockel43_ACAGTG_L001_R2_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                2
##                Hofmockel57_GCCAAT_L001_R1_001
## IF31_RS0116175                              2
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                3
##                Hofmockel57_GCCAAT_L001_R2_001
## IF31_RS0116175                              0
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel59_CAGATC_L001_R1_001
## IF31_RS0116175                              1
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                1
##                Hofmockel59_CAGATC_L001_R2_001
## IF31_RS0116175                              1
## K314_RS0111100                              0
## PSNIH1_RS09265                              0
## B072_RS0125095                              0
## G618_RS0100660                              0
## TU94_RS12250                                0
##                Hofmockel7_AGTCAA_L003_R1_001 Hofmockel7_AGTCAA_L003_R2_001
## IF31_RS0116175                             0                             0
## K314_RS0111100                             0                             0
## PSNIH1_RS09265                             0                             0
## B072_RS0125095                             0                             0
## G618_RS0100660                             0                             0
## TU94_RS12250                               1                             0
ncol(summary_count_21)
## [1] 104
nrow(summary_count_21)
## [1] 6215

As you can see, the summary_count table for this enzyme consists of columns representing each sample in our original dataset(different name and there were two rounds for each, but we will tackle that in the final analysis). 6,215 rows! Each row is the count of sequences associated with the nucleotide(gene) of interest (i.e. IF31_RS0116175) that were found in each metagenome.

These rows were then summed and the final metadata had counts for each gene, so that we ended up adding 3 columns to our meta data.

Data Analysis

We are starting with these files:

  1. The meta data: “../data/KBase_MGRast_Metadata_9May2013_EMB.csv”
  2. The three count tables for the three enzymes: “../data/summary_counts/summary-count-**.tsv“(there are three files in the summary_counts folder)
  3. The sample link file that connects our metagenome names to the sample names: “../data/SampleLink4.csv”

Loading data and setting working directory

cobs_data<- read.csv("data/KBase_MGRast_Metadata_9May2013_EMB.csv", stringsAsFactors = FALSE)
head(cobs_data)
##                                                   sample_name
## 1 Unique name or id of the sample the library is derived from
## 2                                            CC12-LM-July2012
## 3                                         CC12-Micro-July2012
## 4                                            CC12-MM-July2012
## 5                                            CC12-SM-July2012
## 6                                            CC12-WS-July2012
##   month_sampled year_sampled     crop_system       sample_block
## 1 month sampled year sampled cropping system experimental block
## 2          July         2012            corn                  1
## 3          July         2012            corn                  1
## 4          July         2012            corn                  1
## 5          July         2012            corn                  1
## 6          July         2012            corn                  1
##             agg_frac                       mgrast_id
## 1 aggregate fraction MG-RAST Enviromental Package ID
## 2                 LM                        mge94361
## 3              micro                        mge94364
## 4                 MM                        mge94367
## 5                 SM                        mge94370
## 6                 WS                        mge94373
##                                                                                                                                                                                                                                                                               agrochem_addition
## 1                                                                                                                                                                                                                   addition of fertilizers, pesticides, etc. - amount and time of applications
## 2 72 lb N/acre liquid urea ammonia nitrate (32% solution) applied at planting (2012-5-11); Spray 28 oz/ac  Roundup PowerMax (49% glyphosate) + 19 oz/ac Outlook (2012-5-12);  100 lb N/acre liquid urea ammonia nitrate (32% solution) + 28 oz/ac Roundup PowerMax (49% glyphosate) (2012-6-12)
## 3 72 lb N/acre liquid urea ammonia nitrate (32% solution) applied at planting (2012-5-11); Spray 28 oz/ac  Roundup PowerMax (49% glyphosate) + 19 oz/ac Outlook (2012-5-12);  100 lb N/acre liquid urea ammonia nitrate (32% solution) + 28 oz/ac Roundup PowerMax (49% glyphosate) (2012-6-12)
## 4 72 lb N/acre liquid urea ammonia nitrate (32% solution) applied at planting (2012-5-11); Spray 28 oz/ac  Roundup PowerMax (49% glyphosate) + 19 oz/ac Outlook (2012-5-12);  100 lb N/acre liquid urea ammonia nitrate (32% solution) + 28 oz/ac Roundup PowerMax (49% glyphosate) (2012-6-12)
## 5 72 lb N/acre liquid urea ammonia nitrate (32% solution) applied at planting (2012-5-11); Spray 28 oz/ac  Roundup PowerMax (49% glyphosate) + 19 oz/ac Outlook (2012-5-12);  100 lb N/acre liquid urea ammonia nitrate (32% solution) + 28 oz/ac Roundup PowerMax (49% glyphosate) (2012-6-12)
## 6 72 lb N/acre liquid urea ammonia nitrate (32% solution) applied at planting (2012-5-11); Spray 28 oz/ac  Roundup PowerMax (49% glyphosate) + 19 oz/ac Outlook (2012-5-12);  100 lb N/acre liquid urea ammonia nitrate (32% solution) + 28 oz/ac Roundup PowerMax (49% glyphosate) (2012-6-12)
##                                                   crop_rotation
## 1 whether or not crop is rotated, and if yes, rotation schedule
## 2                                                            no
## 3                                                            no
## 4                                                            no
## 5                                                            no
## 6                                                            no
##                   cur_land_use
## 1 present state of sample site
## 2         row crop agriculture
## 3         row crop agriculture
## 4         row crop agriculture
## 5         row crop agriculture
## 6         row crop agriculture
##                                                                                     cur_vegetation
## 1 vegetation classification from one or more standard classification systems, or agricultural crop
## 2                                                                                                7
## 3                                                                                                7
## 4                                                                                                7
## 5                                                                                                7
## 6                                                                                                7
##                                     cur_vegetation_meth
## 1 reference or method used in vegetation classification
## 2                                                 USNVC
## 3                                                 USNVC
## 4                                                 USNVC
## 5                                                 USNVC
## 6                                                 USNVC
##                                                           drainage_class
## 1 drainage classification from a standard system such as the USDA system
## 2                                                         poorly drained
## 3                                                         poorly drained
## 4                                                         poorly drained
## 5                                                         poorly drained
## 6                                                         poorly drained
##                                                          extreme_event
## 1 unusual physical events that may have affected microbial populations
## 2                                                    2012 drought year
## 3                                                    2012 drought year
## 4                                                    2012 drought year
## 5                                                    2012 drought year
## 6                                                    2012 drought year
##                                                                      fao_class
## 1 soil classification from the FAO World Reference Database for Soil Resources
## 2                                                                     Phaeozem
## 3                                                                     Phaeozem
## 4                                                                     Phaeozem
## 5                                                                     Phaeozem
## 6                                                                     Phaeozem
##                                          fire
## 1 historical and/or physical evidence of fire
## 2                                          no
## 3                                          no
## 4                                          no
## 5                                          no
## 6                                          no
##                                                                                                                                                             horizon
## 1 specific layer in the land area which measures parallel to the soil surface and possesses physical characteristics which differ from the layers above and beneath
## 2                                                                                                                                                                Ap
## 3                                                                                                                                                                Ap
## 4                                                                                                                                                                Ap
## 5                                                                                                                                                                Ap
## 6                                                                                                                                                                Ap
##                                          horizon_meth
## 1 reference or method used in determining the horizon
## 2                                    USDA Soil Survey
## 3                                    USDA Soil Survey
## 4                                    USDA Soil Survey
## 5                                    USDA Soil Survey
## 6                                    USDA Soil Survey
##                                                        link_class_info
## 1 link to digitized soil maps or other soil classification information
## 2            http://websoilsurvey.nrcs.usda.gov/app/WebSoilSurvey.aspx
## 3            http://websoilsurvey.nrcs.usda.gov/app/WebSoilSurvey.aspx
## 4            http://websoilsurvey.nrcs.usda.gov/app/WebSoilSurvey.aspx
## 5            http://websoilsurvey.nrcs.usda.gov/app/WebSoilSurvey.aspx
## 6            http://websoilsurvey.nrcs.usda.gov/app/WebSoilSurvey.aspx
##                                                     local_class
## 1 soil classification based on local soil classification system
## 2                                                    Endoaquoll
## 3                                                    Endoaquoll
## 4                                                    Endoaquoll
## 5                                                    Endoaquoll
## 6                                                    Endoaquoll
##                                                        local_class_meth
## 1 reference or method used in determining the local soil classification
## 2                                                             USDA NRCS
## 3                                                             USDA NRCS
## 4                                                             USDA NRCS
## 5                                                             USDA NRCS
## 6                                                             USDA NRCS
##                       mgrast_id.1
## 1 MG-RAST Enviromental Package ID
## 2                                
## 3                                
## 4                                
## 5                                
## 6                                
##                                                                                                                                                                                                                                    microbial_biomass
## 1 the part of the organic matter in the soil that constitutes living microorganisms smaller than 5-10 _µm. IF you keep this, you would need to have correction factors used for conversion to the final units, which should be mg C (or N)/kg soil).
## 2                                                                                                                                                                                                                                                   
## 3                                                                                                                                                                                                                                                   
## 4                                                                                                                                                                                                                                                   
## 5                                                                                                                                                                                                                                                   
## 6                                                                                                                                                                                                                                                   
##                                      microbial_biomass_meth
## 1 reference or method used in determining microbial biomass
## 2                                                          
## 3                                                          
## 4                                                          
## 5                                                          
## 6                                                          
##                                                                        misc_param
## 1 any other measurement performed or parameter collected, that is not listed here
## 2                                                                                
## 3                                                                                
## 4                                                                                
## 5                                                                                
## 6                                                                                
##               ph                                    ph_meth
## 1 pH measurement reference or method used in determining pH
## 2           6.77                     pH meter, 1:2 soil:H2O
## 3           6.77                     pH meter, 1:2 soil:H2O
## 4           6.77                     pH meter, 1:2 soil:H2O
## 5           6.77                     pH meter, 1:2 soil:H2O
## 6           6.77                     pH meter, 1:2 soil:H2O
##                                pool_dna_extracts
## 1 were multiple DNA extractions mixed? how many?
## 2                                             no
## 3                                             no
## 4                                             no
## 5                                             no
## 6                                             no
##             previous_land_use
## 1 previous land use and dates
## 2            row crop, ?-2007
## 3            row crop, ?-2007
## 4            row crop, ?-2007
## 5            row crop, ?-2007
## 6            row crop, ?-2007
##                                                previous_land_use_meth
## 1 reference or method used in determining previous land use and dates
## 2                                                                    
## 3                                                                    
## 4                                                                    
## 5                                                                    
## 6                                                                    
##                                                                                                             profile_position
## 1 cross-sectional position in the hillslope where sample was collected.sample area position in relation to surrounding areas
## 2                                                                                                                           
## 3                                                                                                                           
## 4                                                                                                                           
## 5                                                                                                                           
## 6                                                                                                                           
##                                      salinity_meth
## 1 reference or method used in determining salinity
## 2                                                 
## 3                                                 
## 4                                                 
## 5                                                 
## 6                                                 
##            samp_weight_dna_ext
## 1 weight (g) of soil processed
## 2                             
## 3                             
## 4                             
## 5                             
## 6                             
##                                                                              sieving
## 1 collection design of pooled samples and/or sieve size and amount of sample sieved 
## 2                                                                             8-2 mm
## 3                                                                             2-1 mm
## 4                                                                          1-0.25 mm
## 5                                                                           <0.25 mm
## 6                                                                               8 mm
##                                                                                                                                                                                                                                                                              slope_aspect
## 1 the direction a slope faces. While looking down a slope use a compass to record the direction you are facing (direction or degrees); e.g., NW or 315_\u008d.  This measure provides an indication of sun and wind exposure that will influence soil temperature and evapotranspiration.
## 2                                                                                                                                                                                                                                                                                        
## 3                                                                                                                                                                                                                                                                                        
## 4                                                                                                                                                                                                                                                                                        
## 5                                                                                                                                                                                                                                                                                        
## 6                                                                                                                                                                                                                                                                                        
##                                                                                                                                                                                                                          slope_gradient
## 1 commonly called ___slope.__\u009d  The angle between ground surface and a horizontal line (in percent).  This is the direction that overland water would flow.  This measure is usually taken with a hand level meter or clinometer. 
## 2                                                                                                                                                                                                                                  0-3%
## 3                                                                                                                                                                                                                                  0-3%
## 4                                                                                                                                                                                                                                  0-3%
## 5                                                                                                                                                                                                                                  0-3%
## 6                                                                                                                                                                                                                                  0-3%
##                                              soil_type
## 1 soil series name or other lower-level classification
## 2                              Webster silty clay loam
## 3                              Webster silty clay loam
## 4                              Webster silty clay loam
## 5                              Webster silty clay loam
## 6                              Webster silty clay loam
##                                                                                 soil_type_meth
## 1 reference or method used in determining soil series name or other lower-level classification
## 2                                                                                    USDA NRCS
## 3                                                                                    USDA NRCS
## 4                                                                                    USDA NRCS
## 5                                                                                    USDA NRCS
## 6                                                                                    USDA NRCS
##                                                                       store_cond
## 1 explain how and for how long the soil sample was stored before DNA extraction.
## 2                                                            (-80C) for 3 months
## 3                                                            (-80C) for 3 months
## 4                                                            (-80C) for 3 months
## 5                                                            (-80C) for 3 months
## 6                                                            (-80C) for 3 months
##                                                                                                                                                                                                                                                 texture
## 1 the relative proportion of different grain sizes of mineral particles in a soil, as described using a standard system; express as % sand (50 um to 2 mm), silt (2 um to 50 um), and clay (<2 um) with textural name (e.g., silty clay loam) optional.
## 2                                                                                                                                                                                                                                      32.2, 37.4, 30.4
## 3                                                                                                                                                                                                                                      32.2, 37.4, 30.5
## 4                                                                                                                                                                                                                                      32.2, 37.4, 30.6
## 5                                                                                                                                                                                                                                      32.2, 37.4, 30.7
## 6                                                                                                                                                                                                                                      32.2, 37.4, 30.8
##                                           texture_meth
## 1 reference or method used in determining soil texture
## 2                                              pipette
## 3                                              pipette
## 4                                              pipette
## 5                                              pipette
## 6                                              pipette
##                           tillage
## 1 note method(s) used for tilling
## 2                              no
## 3                              no
## 4                              no
## 5                              no
## 6                              no
##                                                     tot_n
## 1 total nitrogen content of the soil Units of g N/kg soil
## 2                                             0.103445426
## 3                                              0.09566322
## 4                                             0.104709536
## 5                                             0.103238307
## 6                                             0.098348401
##                                            tot_n_meth
## 1 reference or method used in determining the total N
## 2                                      dry combustion
## 3                                      dry combustion
## 4                                      dry combustion
## 5                                      dry combustion
## 6                                      dry combustion
##                                            tot_org_c_meth
## 1 reference or method used in determining total organic C
## 2                                                        
## 3                                                        
## 4                                                        
## 5                                                        
## 6                                                        
##                                                                                                                        tot_org_carb
## 1 Definition for soil: total organic C content of the soil units of g C/kg soil. Definition otherwise: total organic carbon content
## 2                                                                                                                                  
## 3                                                                                                                                  
## 4                                                                                                                                  
## 5                                                                                                                                  
## 6                                                                                                                                  
##               water_content_soil
## 1 water content (g/g or cm3/cm3)
## 2                    0.131073446
## 3                    0.094882729
## 4                    0.126931567
## 5                    0.116926503
## 6                    0.132502831
##                                             water_content_soil_meth
## 1 reference or method used in determining the water content of soil
## 2                                                       gravimetric
## 3                                                       gravimetric
## 4                                                       gravimetric
## 5                                                       gravimetric
## 6                                                       gravimetric
##   misc_param.1
## 1 total C mg/g
## 2  5.237405777
## 3  4.352434158
## 4  4.894068241
## 5  4.682502747
## 6  4.796504021
##                                                                      misc_param_1
## 1 any other measurement performed or parameter collected, that is not listed here
## 2                                                               env_package: soil
## 3                                                               env_package: soil
## 4                                                               env_package: soil
## 5                                                               env_package: soil
## 6                                                               env_package: soil
##              misc_param_10 misc_param_11          misc_param_12
## 1 MBN ugN g-1 dry soil CFE    MBN gN m-2  ExtC ugC g-1 dry soil
## 2              15.29833436   1.714440509            39.63016527
## 3              15.29833436   1.714440509            39.63016527
## 4              15.29833436   1.714440509            39.63016527
## 5              15.29833436   1.714440509            39.63016527
## 6              15.29833436   1.714440509            39.63016527
##          misc_param_13          misc_param_14        misc_param_15
## 1 Extractable C gC m-2  ExtC ugN g-1 dry soil Extractable N gC m-2
## 2            4.4412391            13.37876886          1.499320303
## 3            4.4412391            13.37876886          1.499320303
## 4            4.4412391            13.37876886          1.499320303
## 5            4.4412391            13.37876886          1.499320303
## 6            4.4412391            13.37876886          1.499320303
##         misc_param_16                        misc_param_17
## 1 bulk density g cm-3 Bray Extractable P, mg kg-1 dry soil
## 2         1.120671355                                   42
## 3         1.120671355                                   42
## 4         1.120671355                                   42
## 5         1.120671355                                   42
## 6         1.120671355                                   42
##        misc_param_18                         misc_param_2
## 1 AMF colonization % AP Activity (nmol/h/g dry aggregate)
## 2         0.96460177                          3817.935026
## 3         0.96460177                          2707.967989
## 4         0.96460177                          3937.288068
## 5         0.96460177                          2347.249447
## 6         0.96460177                           3322.74954
##                           misc_param_3
## 1 BG Activity (nmol/h/g dry aggregate)
## 2                          347.4823871
## 3                          484.6664472
## 4                           452.920951
## 5                          326.0037702
## 6                          599.6004109
##                           misc_param_4
## 1 BX Activity (nmol/h/g dry aggregate)
## 2                          68.18962193
## 3                          77.18980431
## 4                          87.72808269
## 5                          71.67988711
## 6                          91.97449035
##                           misc_param_5
## 1 CB Activity (nmol/h/g dry aggregate)
## 2                          31.71779153
## 3                          44.92794971
## 4                          41.63592794
## 5                          40.65869535
## 6                          63.79000562
##                            misc_param_6
## 1 NAG Activity (nmol/h/g dry aggregate)
## 2                           60.99716163
## 3                            48.0673614
## 4                           75.63453504
## 5                            42.7971232
## 6                           72.83207178
##                                           misc_param_7
## 1 sum C Activity (nmol/h/g dry aggregate) BG + BX + CB
## 2                                          447.3898006
## 3                                          606.7842012
## 4                                          582.2849617
## 5                                          438.3423526
## 6                                          755.3649069
##               misc_param_8    misc_param_9   MBC_MBN_meth   ExtC_ExtN_meth
## 1 MBC ugC g-1 dry soil CFE MBC gC m-2 CFE  MBC_MBN_method ExtC_ExtN_method
## 2              258.8184796     29.00504562      CFE_K2SO4 K2SO4_extraction
## 3              258.8184796     29.00504562      CFE_K2SO4 K2SO4_extraction
## 4              258.8184796     29.00504562      CFE_K2SO4 K2SO4_extraction
## 5              258.8184796     29.00504562      CFE_K2SO4 K2SO4_extraction
## 6              258.8184796     29.00504562      CFE_K2SO4 K2SO4_extraction
##      AMF_col_meth      misc_para19     root_depth     misc_para20
## 1  AMF_col_method root_biomassMgHa root_depth(cm) AMF_col_biomass
## 2 intersect_count           0.2197           0-15          0.2119
## 3 intersect_count           0.2197           0-15          0.2119
## 4 intersect_count           0.2197           0-15          0.2119
## 5 intersect_count           0.2197           0-15          0.2119
## 6 intersect_count           0.2197           0-15          0.2119
##   misc_para21 misc_para22       misc_para23           misc_para24
## 1     MBC:MBN      MWD_um prop_agg_fraction N2O_gas2011_umol_m2_h
## 2 16.91808229 3014.572915       0.513651935                0.4021
## 3 16.91808229 3014.572915       0.053514691                0.4021
## 4 16.91808229 3014.572915       0.193260625                0.4021
## 5 16.91808229 3014.572915        0.23957275                0.4021
## 6 16.91808229 3014.572915                 .                0.4021
##             misc_para25           misc_para26           misc_para27
## 1 CH4_gas2011_umol_m2_h N20_gas2012_umol_m2_h CO2_gas2011_umol_m2_s
## 2               -0.1892           0.122794592                  4.29
## 3               -0.1892           0.122794592                  4.29
## 4               -0.1892           0.122794592                  4.29
## 5               -0.1892           0.122794592                  4.29
## 6               -0.1892           0.122794592                  4.29
##             misc_para28
## 1 CO2_gas2012_umol_m2_h
## 2                     .
## 3                     .
## 4                     .
## 5                     .
## 6                     .

loading necessary libraries

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(stringi)

changing heading names

colnames(cobs_data)<-c("sample_Id" , "sample_month" , "sample_year" , "crop" , "sample_block" , "agg_frac" , "MGRAST_Id" , "agrochem_addition" , "crop_rot" , "land_use" , "veg_class" , "veg_class_meth" , "drain_class" , "extreme_event" , "FAO_class" , "fire_hist" , "soil_hor" , "soil_hor_meth" , "link_soil_method" , "soil_tax" , "soil_tax_meth" , "MGRAST_Id" , "micro_bm" , "micro_bm_meth" , "misc_param" , "pH" , "pH_meth" , "dna_mix" , "land_use_pre" , "land_use_pre_meth" , "sample_position" , "salinity_meth" , "sample_wt_dna" , "siev_size" , "slope_aspect" , "slope_grad" , "soil_type" , "soil_type_meth" , "store_cond" , "texture" , "texture_meth" , "till" , "total_N" , "total_N_meth" , "total_OC_meth" , "total_OC" , "soil_water" , "soil_water_meth" , "total_C" , "misc_param_1" , "MBN_dry" , "MBN_applied" , "Ext_C_dry" , "Ext_C_applied" , "Ext_C_N_dry" , "Ext_N_applied" , "Bulk_dense" , "Ext_P_dry" , "AMF_col" , "AP_act" , "BG_act" , "BX_act" , "CB_act" , "NAG_act" , "Sum_C_act" , "MBC_dry" , "MBC_applied" , "MBC_MBN_meth" , "Ext_C_Ext_N_meth" , "AMF_col_meth" , "root_bm" , "root_dep" , "AMF_col_bm" , "MBC:MBN" , "MWD" , "agg_frac_prop" , "N2O_2011" , "CH4_2011" , "N2O_2012" , "CO2_2011" , "CO2_2012")

In order to make the data more human-friendly, I changed the heading names. They are all now in a similar form and are more intuitive. These are the heading names which will be used within the data dictionary.

removing duplicate column & first row

to_remove <- names(which(table(names(cobs_data)) > 1))
cobs_updated <- cobs_data[-1, !(to_remove == names(cobs_data))]

After preparing the data with better-suited header names, the next step was to delete any columns which were repeated. Tidy data does not include duplicate columns. There were two columns which had the exact same data repeated, and those were eliminated. Secondly, the first row was deleted, as it did not contain any data, just a description of the header. This was addressed in the separate data dictionary document, and so was unnecessary within the data set.

summary to find blank/null columns

summary(cobs_updated)
##   sample_Id         sample_month       sample_year       
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      crop           sample_block         agg_frac        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  agrochem_addition    crop_rot           land_use        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   veg_class         veg_class_meth     drain_class       
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  extreme_event       FAO_class          fire_hist        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    soil_hor         soil_hor_meth      link_soil_method  
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    soil_tax         soil_tax_meth        micro_bm        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  micro_bm_meth       misc_param             pH           
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    pH_meth            dna_mix          land_use_pre      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  land_use_pre_meth  sample_position    salinity_meth     
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  sample_wt_dna       siev_size         slope_aspect      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   slope_grad         soil_type         soil_type_meth    
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   store_cond          texture          texture_meth      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      till             total_N          total_N_meth      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  total_OC_meth        total_OC          soil_water       
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  soil_water_meth      total_C          misc_param_1      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    MBN_dry          MBN_applied         Ext_C_dry        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  Ext_C_applied      Ext_C_N_dry        Ext_N_applied     
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   Bulk_dense         Ext_P_dry           AMF_col         
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##     AP_act             BG_act             BX_act         
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##     CB_act            NAG_act           Sum_C_act        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    MBC_dry          MBC_applied        MBC_MBN_meth      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  Ext_C_Ext_N_meth   AMF_col_meth         root_bm         
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    root_dep          AMF_col_bm          MBC:MBN         
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      MWD            agg_frac_prop        N2O_2011        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    CH4_2011           N2O_2012           CO2_2011        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    CO2_2012        
##  Length:120        
##  Class :character  
##  Mode  :character
table(cobs_updated$total_OC_meth)
## 
##     
## 120
empty <- numeric(0)
for(i in 1:ncol(cobs_updated)){
  if(sum(cobs_updated[, i] == "") == nrow(cobs_updated)) {
   empty <- c(empty, i)
  }
}
subset_cobs <- select(cobs_updated, -empty)

By looking at the raw data, I could tell there were a few columns which did not contain any data. In order to tidy the data set, I wanted to delete these. By using the summarize function, and also creating a subset “empty” to house all columns which fit my description (no data values), I was able to subset the data by selecting only the columns which were not included in “empty”, leaving only columns with values.

parsing columns

parsed_cobs <- subset_cobs %>%
  separate("texture", into = c("sand", "silt", "clay"), sep=",") %>%
  separate(sample_Id, into = c("plot_treatment", "agg_fraction", "date"), sep="-") 

There were two columns within the data set which held multiple pieces of information, able to be split into individual columns. Soil “texture” held percentages for sand, silt, and clay, so I placed each value in its own column. Secondly, the column “sample Id” contained three different categories of information. This was divided into plot treatment, aggregate fraction, and date for easier use in analysis.

Deleting after parsing

parsed_cobs[2:3]<- list(NULL)

After splitting the columns into individual pieces, it was necessary to delete those which now repeated information. The plot treatment and date became columns with repeated data, so they were deleted.

splitting column with regex

library(stringi)

parsed_cobs$plot <- unlist(stri_extract_all_regex(parsed_cobs$plot_treatment, pattern = "[0-9]+"))
parsed_cobs$treatment <- unlist(stri_extract_all_regex(parsed_cobs$plot_treatment, pattern = "[A-Z]+"))

The column “plot treatment” actually contained two pieces of information. The plot was in numeric form (12,21,35,43,13,24,31,46,15,23,32) and the treatment was in the form of 1-2 letters representing continuous corn (CC), prairie (P), and fertilized prairie (PF). Because of these forms, it was more challenging to parse, especially because there was no dividing agent within the cell. So, I used a function regex, to extract certain components. First, extracting only numeric values, and placing them into their own column “plot”. Second, I repeated the function, but selected the opposite (character) and placed this piece in its own column, “treatment”.

finalizing and selecting to form tidy subset

tidy_cobs <- select(parsed_cobs, plot, treatment, sample_month:CO2_2012)

The final component of tidying this data included selecting all of the columns I wanted to be present within the subset.

## Loading data and setting working directory
library(dplyr)
library(ggplot2)
cobs_data<- read.csv("data/KBase_MGRast_Metadata_9May2013_EMB.csv", stringsAsFactors = FALSE)

## changing heading names

colnames(cobs_data)<-c("sample_Id" , "sample_month" , "sample_year" , "crop" , "sample_block" , "agg_frac" , "MGRAST_Id" , "agrochem_addition" , "crop_rot" , "land_use" , "veg_class" , "veg_class_meth" , "drain_class" , "extreme_event" , "FAO_class" , "fire_hist" , "soil_hor" , "soil_hor_meth" , "link_soil_method" , "soil_tax" , "soil_tax_meth" , "MGRAST_Id" , "micro_bm" , "micro_bm_meth" , "misc_param" , "pH" , "pH_meth" , "dna_mix" , "land_use_pre" , "land_use_pre_meth" , "sample_position" , "salinity_meth" , "sample_wt_dna" , "siev_size" , "slope_aspect" , "slope_grad" , "soil_type" , "soil_type_meth" , "store_cond" , "texture" , "texture_meth" , "till" , "total_N" , "total_N_meth" , "total_OC_meth" , "total_OC" , "soil_water" , "soil_water_meth" , "total_C" , "misc_param_1" , "MBN_dry" , "MBN_applied" , "Ext_C_dry" , "Ext_C_applied" , "Ext_C_N_dry" , "Ext_N_applied" , "Bulk_dense" , "Ext_P_dry" , "AMF_col" , "AP_act" , "BG_act" , "BX_act" , "CB_act" , "NAG_act" , "Sum_C_act" , "MBC_dry" , "MBC_applied" , "MBC_MBN_meth" , "Ext_C_Ext_N_meth" , "AMF_col_meth" , "root_bm" , "root_dep" , "AMF_col_bm" , "MBC:MBN" , "MWD" , "agg_frac_prop" , "N2O_2011" , "CH4_2011" , "N2O_2012" , "CO2_2011" , "CO2_2012")

## name to remove (duplicate column & first row)

to_remove <- names(which(table(names(cobs_data)) > 1))
cobs_updated <- cobs_data[-1, !(to_remove == names(cobs_data))]

## summary to find if blanks are blanks/nulls/etc.

summary(cobs_updated)
##   sample_Id         sample_month       sample_year       
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      crop           sample_block         agg_frac        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  agrochem_addition    crop_rot           land_use        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   veg_class         veg_class_meth     drain_class       
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  extreme_event       FAO_class          fire_hist        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    soil_hor         soil_hor_meth      link_soil_method  
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    soil_tax         soil_tax_meth        micro_bm        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  micro_bm_meth       misc_param             pH           
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    pH_meth            dna_mix          land_use_pre      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  land_use_pre_meth  sample_position    salinity_meth     
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  sample_wt_dna       siev_size         slope_aspect      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   slope_grad         soil_type         soil_type_meth    
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   store_cond          texture          texture_meth      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      till             total_N          total_N_meth      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  total_OC_meth        total_OC          soil_water       
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  soil_water_meth      total_C          misc_param_1      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    MBN_dry          MBN_applied         Ext_C_dry        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  Ext_C_applied      Ext_C_N_dry        Ext_N_applied     
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   Bulk_dense         Ext_P_dry           AMF_col         
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##     AP_act             BG_act             BX_act         
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##     CB_act            NAG_act           Sum_C_act        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    MBC_dry          MBC_applied        MBC_MBN_meth      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  Ext_C_Ext_N_meth   AMF_col_meth         root_bm         
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    root_dep          AMF_col_bm          MBC:MBN         
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      MWD            agg_frac_prop        N2O_2011        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    CH4_2011           N2O_2012           CO2_2011        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    CO2_2012        
##  Length:120        
##  Class :character  
##  Mode  :character
table(cobs_updated$total_OC_meth)
## 
##     
## 120
empty <- numeric(0)
for(i in 1:ncol(cobs_updated)){
  if(sum(cobs_updated[, i] == "") == nrow(cobs_updated)) {
    empty <- c(empty, i)
  }
}
subset_cobs <- select(cobs_updated, -empty)


## parsing columns
library(tidyverse)
parsed_cobs <- subset_cobs %>%
  separate("texture", into = c("sand", "silt", "clay"), sep=",") %>%
  separate(sample_Id, into = c("plot_treatment", "agg_fraction", "date"), sep="-") 
  
  ## deleting unnecessary columns after parsing    
  parsed_cobs[2:3]<- list(NULL)



## splitting column "plot_treatment" using regex
library(stringi)
parsed_cobs$plot <- unlist(stri_extract_all_regex(parsed_cobs$plot_treatment, pattern = "[0-9]+"))
parsed_cobs$treatment <- unlist(stri_extract_all_regex(parsed_cobs$plot_treatment, pattern = "[A-Z]+"))

tidy_cobs <- select(parsed_cobs, plot, treatment, sample_month:CO2_2012)
knitr::opts_chunk$set(echo = TRUE)
#creating data dictionary 
data_dictionary <- data.frame(
  name = c("plot", "treatment", "sample_month", "sample_year", "crop" , "sample_block" , "agg_frac", "agrochem_addition", "crop_rot" , "land_use" , "veg_class" , "veg_class_meth" , "drain_class" , "extreme_event" , "FAO_class" , "fire_hist" , "soil_hor" , "soil_hor_meth" , "link_soil_method" , "soil_tax" , "soil_tax_meth", "pH" , "pH_meth" , "dna_mix" , "land_use_pre", "siev_size", "slope_grad" , "soil_type" , "soil_type_meth" , "store_cond" , "sand", "silt", "clay", "texture_meth" , "till" , "total_N" , "total_N_meth", "soil_water" , "soil_water_meth" , "total_C" , "misc_param_1" , "MBN_dry" , "MBN_applied" , "Ext_C_dry" , "Ext_C_applied" , "Ext_C_N_dry" , "Ext_N_applied" , "Bulk_dense" , "Ext_P_dry" , "AMF_col" , "AP_act" , "BG_act" , "BX_act" , "CB_act" , "NAG_act" , "Sum_C_act" , "MBC_dry" , "MBC_applied" , "MBC_MBN_meth" , "Ext_C_Ext_N_meth" , "AMF_col_meth" , "root_bm" , "root_dep" , "AMF_col_bm" , "MBC:MBN" , "MWD" , "agg_frac_prop" , "N2O_2011" , "CH4_2011" , "N2O_2012" , "CO2_2011" , "CO2_2012"),
  column = c("Plot", "Treatment", "Sample Month", "Sample Year", "Crop", "Sample Block", "Aggregate Fraction", "Agrochemical Addition", "Crop Rotation", "Current Land Use", "Vegetation Class", "Vegetation Class Method", "Drainage Class", "Extreme Events", "FAO Soil Class", "Fire History", "Soil Horizon", "Soil Horizon Method", "Link to Soil Horizon Method", "Soil Taxonomy", "Soil Taxonomy Method", "pH", "pH Method", "Presence and Number of Mixed DNA Extractions", "Previous Land Use", "Size of Sieve", "Slope Gradient", "Soil Type", "Soil Type Method", "Condition of Stored Soil Sample", "Sand", "Silt", "Clay", "Soil Texture Method", "Tillage Type", "Total Nitrogen", "Total Nitrogen Method", "Soil Water Content", "Soil Water Content Method", "Total Carbon", "Miscellaneous Parameter 1", "Microbial Biomass Nitrogen Dry", "Microbial Biomass Nitrogen Applied", "Extractable Carbon Dry", "Extractable Carbon Applied", "Extractable Carbon and Nitrogen Dry", "Extractable Nitrogen Applied", "Bulk Density", "Extractable Phosphorous Dry", "AMF Colony", "Acid Phosphatase Activity", "$\\beta$-Glucosidase Activity", "$\\beta$-Xylosidase Activity", "Cellobiohyrolase Activity", "$\\beta$-N-acetylglucosaminidase Activity", "Sum Carbon Activity", "Microbial Biomass Carbon Dry", "Microbial Biomass Carbon Applied", "Microbial Carbon and Nitrogen Method", "Extractable Carbon and Nitrogen Analysis Method", "AMF Colony Method", "Root Biomass", "Root Depth", "AMF Colony Biomass", "Microbial Carbon:Microbial Nitrogen Ratio",   "Mean Weight Diameter", "Aggregate Fraction", "Nitrous Oxide Measurements 2011", "Methane Measurements 2011", "Nitrous Oxide Measurements 2012", "Carbon Dioxide Measurements 2011", "Carbon Dioxide Measurements 2012"),
  description = c("Plot sample was taken from", "Treatment number plot received", "Month the sample was taken", "Year the sample was taken", "Cropping system", "Experimental block", "Fraction of aggregate", "Addition of fertilizers, pesticides, amount and time of applications", "Whether or not crop is rotated, and if yes, rotation schedule", "Current land use of sample site", "Vegetation classification", "Method of vegetation classification", "Drainage classification", "Unusual physical events that may have affected microbial populations", "Soil classification from the FAO World Reference Database for Soil Resources", "Historical and/or physical evidence of fire", "Layer of soil which exhibits physical characteristics that vary compared to the layers above and beneath", "Method used in determining the horizon", "Web link to digitized soil maps or other soil classification information", "Soil classification based on local soil classification system", "Method used in determining the local soil classification", "pH measurement", "Method for determining pH", "Number of mixed DNA extractions (string(???)) Only one DNA extraction used (no)", "Previous land use and dates of recorded history", "Size of sieve in millimeters", "The angle between ground surface and a horizontal line in percent", "Soil series name or other lower-level classification", "Method used in determining soil series name or other lower-level classification", "How and for how long the soil sample was stored before DNA extraction", "Relative proportion of percent sand (50 um to 2 mm)", "Relative proportion of silt (2 um to 50 um)", "Relative proportion of clay (<2 um)", "Method used in determining soil texture", "Method used for tilling", "Total nitrogen content of the soil in grams of Nitrogen per kilogram of soil", "Method used in determining the total Nitrogen", "Water content (g/g or cm3/cm3) in soil", "Method used in determining the water content of soil", "Total Carbon in mg/g", "Any other measurement performed or parameter collected that is not listed here", "Amount of Nitrogen in the microbial biomass in micrograms of Nitrogen per 1 gram of dry soil", "CFE  MBN gN m-2", "Extractable Carbon in micrograms of Carbon per gram of dry soil", "Extractable Carbon in grams of Carbon per square meter","Extractable Carbon in micrograms of Nitrogen per gram of dry soil", "Extractable Nitrogen in grams of Carbon per square meter", "Bulk density in grams per cubed centimeter", "Extractable Phosphorous using Bray's extraction method in milligras per kilogram of dry soil", "Percentage of Community of arbuscular mycorrhizal fungi", "Acid Phosphatase (AP) Activity in nanomoles of hydrogen per gram of dry aggregate", "$\\beta$-Glucosidase (BG) Activity in nanomoles of hydrogen per gram of dry aggregate", "$\\beta$-Xylosidase (BX) Activity in nanomoles of hydrogen per gram of dry aggregate", "Cellobiohyrolase (CB) Activity in nanomoles of hydrogen per gram of dry aggregate", "$\\beta$-N-acetylglucosaminidase (NAG) Activity in nanomoles of hydrogen per gram of dry aggregate", "Sum of Carbon activity in nanomoles of hydrogen per gram of dry aggregate", "BG + BX + CB  MBC ugC g-1 dry soil CFE", "MBC gC m-2 CFE", "Method for Carbon and Nitrogen analysis in microbial", "Method for extractable Carbon and Nitrogen analysis", "Method for Community of arbuscular mycorrhizal fungi", "Amount of root biomass in megagrams per hectare", "Depth of roots in centimeters", "Biomass of Community of arbuscular mycorrhizal fungi", "Ratio of microbial biomass carbon to nitrogen",    "Mean Weight Diameter in micromoles", "Proportion of aggregate fraction", "Nitrous Oxide Measurements 2011 in micromoles per square meter", "Methane Measurements 2011 in micromoles per square meter", "Nitrous Oxide Measurements 2012 in micromoles per square meter", "Carbon Dioxide Measurements 2011 in micromoles per square meter", "Carbon Dioxide Measurements 2012 in micromoles per square meter"))
knitr::kable(data_dictionary)
name column description
plot Plot Plot sample was taken from
treatment Treatment Treatment number plot received
sample_month Sample Month Month the sample was taken
sample_year Sample Year Year the sample was taken
crop Crop Cropping system
sample_block Sample Block Experimental block
agg_frac Aggregate Fraction Fraction of aggregate
agrochem_addition Agrochemical Addition Addition of fertilizers, pesticides, amount and time of applications
crop_rot Crop Rotation Whether or not crop is rotated, and if yes, rotation schedule
land_use Current Land Use Current land use of sample site
veg_class Vegetation Class Vegetation classification
veg_class_meth Vegetation Class Method Method of vegetation classification
drain_class Drainage Class Drainage classification
extreme_event Extreme Events Unusual physical events that may have affected microbial populations
FAO_class FAO Soil Class Soil classification from the FAO World Reference Database for Soil Resources
fire_hist Fire History Historical and/or physical evidence of fire
soil_hor Soil Horizon Layer of soil which exhibits physical characteristics that vary compared to the layers above and beneath
soil_hor_meth Soil Horizon Method Method used in determining the horizon
link_soil_method Link to Soil Horizon Method Web link to digitized soil maps or other soil classification information
soil_tax Soil Taxonomy Soil classification based on local soil classification system
soil_tax_meth Soil Taxonomy Method Method used in determining the local soil classification
pH pH pH measurement
pH_meth pH Method Method for determining pH
dna_mix Presence and Number of Mixed DNA Extractions Number of mixed DNA extractions (string(???)) Only one DNA extraction used (no)
land_use_pre Previous Land Use Previous land use and dates of recorded history
siev_size Size of Sieve Size of sieve in millimeters
slope_grad Slope Gradient The angle between ground surface and a horizontal line in percent
soil_type Soil Type Soil series name or other lower-level classification
soil_type_meth Soil Type Method Method used in determining soil series name or other lower-level classification
store_cond Condition of Stored Soil Sample How and for how long the soil sample was stored before DNA extraction
sand Sand Relative proportion of percent sand (50 um to 2 mm)
silt Silt Relative proportion of silt (2 um to 50 um)
clay Clay Relative proportion of clay (<2 um)
texture_meth Soil Texture Method Method used in determining soil texture
till Tillage Type Method used for tilling
total_N Total Nitrogen Total nitrogen content of the soil in grams of Nitrogen per kilogram of soil
total_N_meth Total Nitrogen Method Method used in determining the total Nitrogen
soil_water Soil Water Content Water content (g/g or cm3/cm3) in soil
soil_water_meth Soil Water Content Method Method used in determining the water content of soil
total_C Total Carbon Total Carbon in mg/g
misc_param_1 Miscellaneous Parameter 1 Any other measurement performed or parameter collected that is not listed here
MBN_dry Microbial Biomass Nitrogen Dry Amount of Nitrogen in the microbial biomass in micrograms of Nitrogen per 1 gram of dry soil
MBN_applied Microbial Biomass Nitrogen Applied CFE MBN gN m-2
Ext_C_dry Extractable Carbon Dry Extractable Carbon in micrograms of Carbon per gram of dry soil
Ext_C_applied Extractable Carbon Applied Extractable Carbon in grams of Carbon per square meter
Ext_C_N_dry Extractable Carbon and Nitrogen Dry Extractable Carbon in micrograms of Nitrogen per gram of dry soil
Ext_N_applied Extractable Nitrogen Applied Extractable Nitrogen in grams of Carbon per square meter
Bulk_dense Bulk Density Bulk density in grams per cubed centimeter
Ext_P_dry Extractable Phosphorous Dry Extractable Phosphorous using Bray’s extraction method in milligras per kilogram of dry soil
AMF_col AMF Colony Percentage of Community of arbuscular mycorrhizal fungi
AP_act Acid Phosphatase Activity Acid Phosphatase (AP) Activity in nanomoles of hydrogen per gram of dry aggregate
BG_act \(\beta\)-Glucosidase Activity \(\beta\)-Glucosidase (BG) Activity in nanomoles of hydrogen per gram of dry aggregate
BX_act \(\beta\)-Xylosidase Activity \(\beta\)-Xylosidase (BX) Activity in nanomoles of hydrogen per gram of dry aggregate
CB_act Cellobiohyrolase Activity Cellobiohyrolase (CB) Activity in nanomoles of hydrogen per gram of dry aggregate
NAG_act \(\beta\)-N-acetylglucosaminidase Activity \(\beta\)-N-acetylglucosaminidase (NAG) Activity in nanomoles of hydrogen per gram of dry aggregate
Sum_C_act Sum Carbon Activity Sum of Carbon activity in nanomoles of hydrogen per gram of dry aggregate
MBC_dry Microbial Biomass Carbon Dry BG + BX + CB MBC ugC g-1 dry soil CFE
MBC_applied Microbial Biomass Carbon Applied MBC gC m-2 CFE
MBC_MBN_meth Microbial Carbon and Nitrogen Method Method for Carbon and Nitrogen analysis in microbial
Ext_C_Ext_N_meth Extractable Carbon and Nitrogen Analysis Method Method for extractable Carbon and Nitrogen analysis
AMF_col_meth AMF Colony Method Method for Community of arbuscular mycorrhizal fungi
root_bm Root Biomass Amount of root biomass in megagrams per hectare
root_dep Root Depth Depth of roots in centimeters
AMF_col_bm AMF Colony Biomass Biomass of Community of arbuscular mycorrhizal fungi
MBC:MBN Microbial Carbon:Microbial Nitrogen Ratio Ratio of microbial biomass carbon to nitrogen
MWD Mean Weight Diameter Mean Weight Diameter in micromoles
agg_frac_prop Aggregate Fraction Proportion of aggregate fraction
N2O_2011 Nitrous Oxide Measurements 2011 Nitrous Oxide Measurements 2011 in micromoles per square meter
CH4_2011 Methane Measurements 2011 Methane Measurements 2011 in micromoles per square meter
N2O_2012 Nitrous Oxide Measurements 2012 Nitrous Oxide Measurements 2012 in micromoles per square meter
CO2_2011 Carbon Dioxide Measurements 2011 Carbon Dioxide Measurements 2011 in micromoles per square meter
CO2_2012 Carbon Dioxide Measurements 2012 Carbon Dioxide Measurements 2012 in micromoles per square meter
library(dplyr)
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 3.3.2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.4-1

make the tidy metadata file

colnames(cobs_data)<-c("sample_Id" , "sample_month" , "sample_year" , "crop" , "sample_block" , "agg_frac" , "MGRAST_Id" , "agrochem_addition" , "crop_rot" , "land_use" , "veg_class" , "veg_class_meth" , "drain_class" , "extreme_event" , "FAO_class" , "fire_hist" , "soil_hor" , "soil_hor_meth" , "link_soil_method" , "soil_tax" , "soil_tax_meth" , "MGRAST_Id" , "micro_bm" , "micro_bm_meth" , "misc_param" , "pH" , "pH_meth" , "dna_mix" , "land_use_pre" , "land_use_pre_meth" , "sample_position" , "salinity_meth" , "sample_wt_dna" , "siev_size" , "slope_aspect" , "slope_grad" , "soil_type" , "soil_type_meth" , "store_cond" , "texture" , "texture_meth" , "till" , "total_N" , "total_N_meth" , "total_OC_meth" , "total_OC" , "soil_water" , "soil_water_meth" , "total_C" , "misc_param_1" , "MBN_dry" , "MBN_applied" , "Ext_C_dry" , "Ext_C_applied" , "Ext_C_N_dry" , "Ext_N_applied" , "Bulk_dense" , "Ext_P_dry" , "AMF_col" , "AP_act" , "BG_act" , "BX_act" , "CB_act" , "NAG_act" , "Sum_C_act" , "MBC_dry" , "MBC_applied" , "MBC_MBN_meth" , "Ext_C_Ext_N_meth" , "AMF_col_meth" , "root_bm" , "root_dep" , "AMF_col_bm" , "MBC:MBN" , "MWD" , "agg_frac_prop" , "N2O_2011" , "CH4_2011" , "N2O_2012" , "CO2_2011" , "CO2_2012")
to_remove <- names(which(table(names(cobs_data)) > 1))
cobs_updated <- cobs_data[-1, !(to_remove == names(cobs_data))]
summary(cobs_updated)
##   sample_Id         sample_month       sample_year       
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      crop           sample_block         agg_frac        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  agrochem_addition    crop_rot           land_use        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   veg_class         veg_class_meth     drain_class       
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  extreme_event       FAO_class          fire_hist        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    soil_hor         soil_hor_meth      link_soil_method  
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    soil_tax         soil_tax_meth        micro_bm        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  micro_bm_meth       misc_param             pH           
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    pH_meth            dna_mix          land_use_pre      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  land_use_pre_meth  sample_position    salinity_meth     
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  sample_wt_dna       siev_size         slope_aspect      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   slope_grad         soil_type         soil_type_meth    
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   store_cond          texture          texture_meth      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      till             total_N          total_N_meth      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  total_OC_meth        total_OC          soil_water       
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  soil_water_meth      total_C          misc_param_1      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    MBN_dry          MBN_applied         Ext_C_dry        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  Ext_C_applied      Ext_C_N_dry        Ext_N_applied     
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   Bulk_dense         Ext_P_dry           AMF_col         
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##     AP_act             BG_act             BX_act         
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##     CB_act            NAG_act           Sum_C_act        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    MBC_dry          MBC_applied        MBC_MBN_meth      
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  Ext_C_Ext_N_meth   AMF_col_meth         root_bm         
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    root_dep          AMF_col_bm          MBC:MBN         
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      MWD            agg_frac_prop        N2O_2011        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    CH4_2011           N2O_2012           CO2_2011        
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    CO2_2012        
##  Length:120        
##  Class :character  
##  Mode  :character
table(cobs_updated$total_OC_meth)
## 
##     
## 120
empty <- numeric(0)
for(i in 1:ncol(cobs_updated)){
 if(sum(cobs_updated[, i] == "") == nrow(cobs_updated)) {
   empty <- c(empty, i)
 }
}
subset_cobs <- select(cobs_updated, -empty)
library(tidyverse)
parsed_cobs <- subset_cobs %>%
 separate("texture", into = c("sand", "silt", "clay"), sep=",") %>%
 separate(sample_Id, into = c("plot_treatment", "agg_fraction", "date"), sep="-") 
parsed_cobs[2:3]<- list(NULL)
library(stringi)
parsed_cobs$plot <- unlist(stri_extract_all_regex(parsed_cobs$plot_treatment, pattern = "[0-9]+"))
parsed_cobs$treatment <- unlist(stri_extract_all_regex(parsed_cobs$plot_treatment, pattern = "[A-Z]+"))
tidy_cobs <- select(parsed_cobs, plot, treatment, sample_month:CO2_2012)
write.csv(tidy_cobs, "Tidy_Cobs.csv", row.names=F)

Previous analysis

#Read in the abundance, meta and link tables
BG_abun=read.table("data/summary_counts/summary-count-21.tsv",header=T)
BX_abun=read.table("data/summary_counts/summary-count-37.tsv",header=T)
CB_abun=read.table("data/summary_counts/summary-count-91.tsv",header=T)
meta=read.csv("Tidy_Cobs.csv")
smplnk=read.csv("data/SampleLink4.csv")

#Transpose each data frame and modify rownames so they all match
BG.t=data.frame(t(BG_abun))
BG.t$row=row.names(BG.t)
BG.t$row=gsub("_R1", "", BG.t$row)
BG.t$row=gsub("_R2", "", BG.t$row)
BX.t=data.frame(t(BX_abun))
BX.t$row=row.names(BX.t)
BX.t$row=gsub("_R1", "", BX.t$row)
BX.t$row=gsub("_R2", "", BX.t$row)
CB.t=data.frame(t(CB_abun))
CB.t$row=row.names(CB.t)
CB.t$row=gsub("_R1", "", CB.t$row)
CB.t$row=gsub("_R2", "", CB.t$row)
meta$agg_frac<-gsub("micro", "Micro", meta$agg_frac)
meta$group<-paste(meta$treatment, meta$plot, sep="")
meta$myear<-paste(meta$sample_month, meta$sample_year, sep="")
meta$SampleName<-paste(meta$group, meta$agg_frac, meta$myear, sep="-")

#Merge transposed count tables with link tables
smplnkBG=merge(BG.t,smplnk,by.x="row",by.y="rast_file")
smplnkBX=merge(BX.t,smplnk,by.x="row",by.y="rast_file")
smplnkCB=merge(CB.t,smplnk,by.x="row",by.y="rast_file")
mergedataBG=merge(meta,smplnkBG, by.x = "SampleName", by.y = "SampleName")
mergedataBX=merge(meta,smplnkBX, by.x = "SampleName", by.y = "SampleName")
mergedataCB=merge(meta,smplnkCB, by.x = "SampleName", by.y = "SampleName")

#Make counts numeric instead of integers
cn<-c(names(mergedataBG[, 1:77]), "SoilFrac", "Crop")
mergedataBG[, !names(mergedataBG) %in% cn] = lapply(mergedataBG[, !names(mergedataBG) %in% cn], as.numeric)
mergedataBX[, !names(mergedataBX) %in% cn] = lapply(mergedataBX[, !names(mergedataBX) %in% cn], as.numeric)
mergedataCB[, !names(mergedataCB) %in% cn] = lapply(mergedataCB[, !names(mergedataCB) %in% cn], as.numeric)

#Sum gene counts for each metagenome
countBG = rowSums(mergedataBG[,77:6291])
countBX = rowSums(mergedataBX[,77:2532])
countCB = rowSums(mergedataCB[,77:1439])
mergedataBG$BG_counts<-countBG
mergedataBX$BX_counts<-countBX
mergedataCB$CB_counts<-countCB

#Merge gene count columns into one table
mergedata = mergedataBG
mergedata$BX_counts<-paste(mergedataBX$BX_counts)
mergedata$CB_counts<-paste(mergedataCB$CB_counts)
mergedata[,6299:6301] = lapply(mergedata[,6299:6301],as.numeric)
#standardize mergedata using decostand in Vegan package
mergedatastd <- decostand(mergedata[,c("BG_act","BX_act","CB_act","Sum_C_act","CB_counts","BX_counts","BG_counts")], "range")

#Figure 1
myvars <- c("BG_act","BX_act","CB_act","Sum_C_act","CB_counts","BX_counts","BG_counts")
subset = mergedatastd[myvars]
ggpairs(subset, title="Enzyme Abundance and Activity Correlations")

#Export figure 1 as .pdf

a1=ggpairs(subset, title="Enzyme Abundance and Activity Correlations")
pdf("images/correlation.pdf")
print(a1)
dev.off()
#Figure 2a
ggplot(mergedata, aes(x=BG_act, y = BG_counts)) + geom_boxplot() + facet_grid(Crop ~ SoilFrac) + xlab("Beta-glucosidase Activity") + ylab("Beta-glucosidase Gene Abundances") + labs(title="Beta-glucosidase Activity vs. Abundances")

#Figure 2b
ggplot(mergedata, aes(x=BX_act, y = BX_counts)) + geom_boxplot() + facet_grid(Crop ~ SoilFrac) + xlab("Beta-D-xylosidase Activity") + ylab("Beta-D-xylosidase Gene Abundances") +labs(title="Beta-D-xylosidase Activity vs. Abundances")

#Figure 2c
ggplot(mergedata, aes(x=CB_act, y = CB_counts)) + geom_boxplot() + facet_grid(Crop ~ SoilFrac) + xlab("1,4-beta-cellobiosidase Activity") + ylab("1,4-beta-cellobiosidase Gene Abundances") + labs(title="1,4-beta-cellobiosidase Activity vs. Abundances")

#Export Figure 2a-c as .pdf

a2=ggplot(mergedata, aes(x=BG_act, y = BG_counts)) + geom_boxplot() + facet_grid(Crop ~ SoilFrac) + xlab("Beta-glucosidase Activity") + ylab("Beta-glucosidase Gene Abundances") + labs(title="Beta-glucosidase Activity vs. Abundances")
pdf("../images/Fig2BG.pdf", width = 640, height = 640)
print(a2)
dev.off()

b2=ggplot(mergedata, aes(x=BX_act, y = BX_counts)) + geom_boxplot() + facet_grid(Crop ~ SoilFrac) + xlab("Beta-D-xylosidase Activity") + ylab("Beta-D-xylosidase Gene Abundances") +labs(title="Beta-D-xylosidase Activity vs. Abundances")
pdf("../images/Fig2BX.png", width = 640, height = 640)
print(b2)
dev.off()

c2=ggplot(mergedata, aes(x=CB_act, y = CB_counts)) + geom_boxplot() + facet_grid(Crop ~ SoilFrac) + xlab("1,4-beta-cellobiosidase Activity") + ylab("1,4-beta-cellobiosidase Gene Abundances") + labs(title="1,4-beta-cellobiosidase Activity vs. Abundances")
pdf("../images/Fig2CB.pdf", width = 640, height = 640)
print(c2)
dev.off()
#Figure 3a-c
ggplot(mergedata, aes(x=total_C, y = BG_act)) + geom_boxplot() + facet_grid(Crop ~ SoilFrac) + xlab("Total Carbon") + ylab("Beta-glucosidase Gene Abundances") + labs(title="Total Carbon vs. Beta-glucosidase Activity")

ggplot(mergedata, aes(x=total_C, y = CB_act)) + geom_boxplot() + facet_grid(Crop ~ SoilFrac) + xlab("Total Carbon") + ylab("1,4-beta-cellobiosidase Gene Abundances") + labs(title="Total Carbon vs, 1,4-beta-cellobiosidase Activity")

ggplot(mergedata, aes(x=total_C, y = BX_act)) + geom_boxplot() + facet_grid(Crop ~ SoilFrac) + xlab("Total Carbon") + ylab("Gene Abundances") + labs(title="Total Carbon vs. Beta-D-xylosidase Activity")

#Export figures 3a-c as jpegs
a3=ggplot(mergedata, aes(x=total_C, y = BG_act)) + geom_boxplot() + facet_grid(Crop ~ SoilFrac) + xlab("Total Carbon") + ylab("Beta-glucosidase Gene Abundances") + labs(title="Total Carbon vs. Beta-glucosidase Activity")
pdf("../images/Fig3BG.png", width = 640, height = 640)
print(a3)
dev.off()

b3=ggplot(mergedata, aes(x=total_C, y = BX_act)) + geom_boxplot() + facet_grid(Crop ~ SoilFrac) + xlab("Total Carbon") + 
ylab("Gene Abundances") + labs(title="Total Carbon vs. Beta-D-xylosidase Activity")
pdf("../images/Fig3BX.png", width = 640, height = 640)
print(b3)
dev.off()

c3=ggplot(mergedata, aes(x=total_C, y = CB_act)) + geom_boxplot() + facet_grid(Crop ~ SoilFrac) + xlab("Total Carbon") + ylab("1,4-beta-cellobiosidase Gene Abundances") + labs(title="Total Carbon vs, 1,4-beta-cellobiosidase Activity")
pdf("../images/Fig3CB.png", width = 640, height = 640)
print(c3)
dev.off()

Closing thoughts

Once the count tables for all 3 genes were generated and the metadata was in tidy format, all of the tables had to be merged together using common IDs from our sample link table in order to generate one table to use for data analysis and creating figures. First, the count tables were transposed using the t() function to flip the row and columns so they matched to format of the metadata and sample link tables. The gsub() function was also used to fix mismatching portions of column names. Each enzymes transposed count table was then merged with its own sample link table which was then merged with the metadata as well, all using the merge() function. Lapply was used to format specific columns as numeric for plotting purposes. A column containing total counts of each gene was generated by summing across the rows in each newly merged table using the rowSums() function. Using the paste() function, the resulting columns were then added onto one complete master table containing all the metadata and total abundances of all 3 genes and was ready for analysis.

The gene counts in all subsequent steps must be standardized by the abundance of a known housekeeping gene for the numbers to be meaningful. We did not have this information, so while the subsequent figures provide a framework for inserting the standardized gene counts, they are currently not accurate. The ggpairs() function in the GGally package was used to generate a correlation matrix for all 3 enzyme activities and all 3 gene counts. Ggplot was used to create boxplots of each enzyme activity vs gene counts, all faceted by crop type and soil fraction. Ggplot was also used to create boxplots of each enzymes activity vs total carbon measurements found in each sample, again all faceted by crop type and soil fraction.

Additionally, we are missing metegenome information for aggregate fractions in the CC(corn) and P(prairie) treatments, this is why there are missing box plots in the outputs corresponding to those two treatments.

Thank you for reading!